An empirical study on Principal Component Analysis for clustering gene expression data
نویسندگان
چکیده
There is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. Other classical techniques, such as principal component analysis (PCA), have also been applied to analyze gene expression data. Using different data analysis techniques and different clustering algorithms to analyze the same data set can lead to very different conclusions. Our goal is to study the effectiveness of principal components (PC’s) in capturing cluster structure. In other words, we empirically compared the quality of clusters obtained from the original data set to the quality of clusters obtained from clustering the PC’s using both real gene expression data sets and synthetic data sets. Our empirical study showed that clustering with the PC’s instead of the original variables does not necessarily improve cluster quality. In particular, the first few PC’s (which contain most of the variation in the data) do not necessarily capture most of the cluster structure. We also showed that clustering with PC’s has different impact on different algorithms and different similarity metrics.
منابع مشابه
An Empirical Comparison between Grade of Membership and Principal Component Analysis
t is the purpose of this paper to contribute to the discussion initiated byWachter about the parallelism between principal component (PC) and atypological grade of membership (GoM) analysis. The author testedempirically the close relationship between both analysis in a lowdimensional framework comprising up to nine dichotomous variables and twotypologies. Our contribution to the subject is also...
متن کاملPrincipal component analysis for clustering gene expression data
MOTIVATION There is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. Other classical techniques, such as principal component analysis (PCA), have also been app...
متن کاملTree-Dependent Components of Gene Expression Data for Clustering
Tree-dependent component analysis (TCA) is a generalization of independent component analysis (ICA), the goal of which is to model the multivariate data by a linear transformation of latent variables, while latent variables fit by a tree-structured graphical model. In contrast to ICA, TCA allows dependent structure of latent variables and also consider non-spanning trees (forests). In this pape...
متن کاملModification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملChoosing the Best Hierarchical Clustering Technique Based on Principal Components Analysis for Suspended Sediment Load Estimation
1- INTRODUCTION The assessment of watershed sediment load is necessary for controling soil erosion and reducing the potential of sediment production. Different estimates of sediment amounts along with the lack of long-term measurements limits the accessibility to reliable data series of erosion rate and sediment yield. Therefore, the observed data of suspended sediment load could be used to ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001